Always set the directory where you will be working, and load the packages (libraries) that you will need.
rm(list=ls()) # clear all variables in the memory
setwd(path_to_directory) # set the directory
# Load the libraries that you are going to need
library(RColorBrewer) # nice colors
library(ISLR)
cols <- brewer.pal(9, "Set1") # nice color palette
Session environment:
getwd()
where am I?ls()
list R objectsdir()
list files in the directoryrm(list=ls())
clear all variables in the memoryrm('variable_name')
remove only one or a few specified variablesYou can get help with:
help(ls)
or ?ls
help(package = stats)
Google is your best friend: try also googling “<topic> in R”.
Useful tip: tab to autocomplete functions, to see their arguments and to navigate through directories
Assignment operation:
a = 1
b <- 2
3 -> c
a; b; c
[1] 1
[1] 2
[1] 3
Note: the assignment operator <-
can seem strange to some, great to others. If you feel more comfortable, remember you can still use =
. Do not confuse this with the equality operator ==
.
R can interpret expressions
x <- 3
x
[1] 3
and compute mathematical expressions
y <- 2 * x + 4
y
[1] 10
Special types are NA
, NaN
, +Inf
, -Inf
.
vec_na <- c(NA, 3, 4, +Inf)
sum(vec_na)
[1] NA
is.na(vec_na)
[1] TRUE FALSE FALSE FALSE
Remember the effect of “forbidden” operations
+Inf - Inf
[1] NaN
3/0
[1] Inf
-3/0
[1] -Inf
0/0
[1] NaN
Here is a list of the syntax for common mathematical operations. This is not meant to be an exhaustive list.
Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | element-wise multiplication |
/ | element-wise division |
%*% | matrix multiplication |
^ | exponentiation |
%% | modulo division |
< | smaller than |
<= | smaller or equal than |
> | larger or equal than |
>= | larger or equal than |
== | equal |
!= | not equal |
Let us define a vector:
z <- rep(1, 10) # repeats 1 for 10 times
We can loop over some variable using a for loop:
for (i in 3:10){
z[i] <- z[i - 1] + z[i - 2]
}
z
[1] 1 1 2 3 5 8 13 21 34 55
We can also branch the code in chunks according to some criterion, using the if-then-else statement:
if ( (2 + 1) == 3 ) print("yes") else print("no")
[1] "yes"
Please, remember that indenting your code is essential (in some programming languages you’ll get thrown errors). Careful formatting is key to good code.
for (i in 1:10){
if (i == 5){
cat(i, " ")
}
else{
cat("Not five ")
}
}
Not five Not five Not five Not five 5 Not five Not five Not five Not five Not five
The statements if
and for
are sufficient for the vast majority of programs. However, in few instances you might be confortable using other constructs, such as the while loop:
i <- 7
while (i){
z[i] <- z[i] / 2
i <- i - 1
if (i < 3){
break # the breaks exits from the for loop
}
}
z # divide elements 4:7 by 2
[1] 1.0 1.0 1.0 1.5 2.5 4.0 6.5 21.0 34.0 55.0
while
is less common but useful in cases with an indeterminate number of loops.
You can either use read_csv
for a .csv file, or read.table
for a .txt file. It will convert the data to the dataframe type (more on that later).
Auto <- read.csv("./Data/Auto.csv")
Auto[1:4,]
mpg cylinders displacement horsepower weight acceleration year origin
1 18 8 307 130 3504 12.0 70 1
2 15 8 350 165 3693 11.5 70 1
3 18 8 318 150 3436 11.0 70 1
4 16 8 304 150 3433 12.0 70 1
name
1 chevrolet chevelle malibu
2 buick skylark 320
3 plymouth satellite
4 amc rebel sst
dim(Auto)
[1] 397 9
Auto <- na.omit(Auto) # removes NA from data frame
dim(Auto)
[1] 397 9
names(Auto)
[1] "mpg" "cylinders" "displacement" "horsepower" "weight"
[6] "acceleration" "year" "origin" "name"
Conversely, write_csv
can be used to write a dataframe from R to a .csv file.
Vectors can be defined in several ways. One example is the function seq
. It takes as input the bounds of the vector and the stepsize.
vec <- seq(1, 9, by = 0.1)
vec
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
[20] 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7
[39] 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6
[58] 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5
[77] 8.6 8.7 8.8 8.9 9.0
vec <- 1:9 # equivalent to seq(1, 9, by = 1)
vec
[1] 1 2 3 4 5 6 7 8 9
Alternatively, it can takes as input the bounds of the vector and its length.
vec <- seq(1, 9, length.out = 100) # useful for fine grids
vec
[1] 1.000000 1.080808 1.161616 1.242424 1.323232 1.404040 1.484848 1.565657
[9] 1.646465 1.727273 1.808081 1.888889 1.969697 2.050505 2.131313 2.212121
[17] 2.292929 2.373737 2.454545 2.535354 2.616162 2.696970 2.777778 2.858586
[25] 2.939394 3.020202 3.101010 3.181818 3.262626 3.343434 3.424242 3.505051
[33] 3.585859 3.666667 3.747475 3.828283 3.909091 3.989899 4.070707 4.151515
[41] 4.232323 4.313131 4.393939 4.474747 4.555556 4.636364 4.717172 4.797980
[49] 4.878788 4.959596 5.040404 5.121212 5.202020 5.282828 5.363636 5.444444
[57] 5.525253 5.606061 5.686869 5.767677 5.848485 5.929293 6.010101 6.090909
[65] 6.171717 6.252525 6.333333 6.414141 6.494949 6.575758 6.656566 6.737374
[73] 6.818182 6.898990 6.979798 7.060606 7.141414 7.222222 7.303030 7.383838
[81] 7.464646 7.545455 7.626263 7.707071 7.787879 7.868687 7.949495 8.030303
[89] 8.111111 8.191919 8.272727 8.353535 8.434343 8.515152 8.595960 8.676768
[97] 8.757576 8.838384 8.919192 9.000000
One of the most used functions is c()
, which stands for concatenate. You can concatenate vectors with vectors, scalars with scalars, or vectors with scalars.
A <- c(1, 6, 3)
B <- c(0, -2, 1)
vec <- c(A, 2, B)
vec
[1] 1 6 3 2 0 -2 1
And, of course, you can combine all the functions previously seen.
vec <- c(rep(1, 10), rep(2, 20))
vec
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
A <- matrix(1:16, 4, 4)
A
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
Oops, maybe we wanted to actually order those elements by row!
A <- matrix(1:16, 4, 4, byrow = T)
A
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
Fundamentally important functions are cbind()
and rbind()
.
B <- matrix(runif(8), 4, 2, byrow = T)
cbind(A, B)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 0.4430378 0.7645879
[2,] 5 6 7 8 0.5332281 0.9026283
[3,] 9 10 11 12 0.2495218 0.9820944
[4,] 13 14 15 16 0.1327472 0.6736971
C <- matrix(runif(8), 2, 4, byrow = T)
rbind(A, C)
[,1] [,2] [,3] [,4]
[1,] 1.0000000 2.0000000 3.0000000 4.0000000
[2,] 5.0000000 6.0000000 7.0000000 8.0000000
[3,] 9.0000000 10.0000000 11.0000000 12.0000000
[4,] 13.0000000 14.0000000 15.0000000 16.0000000
[5,] 0.8161726 0.9740909 0.1279643 0.9867436
[6,] 0.5706480 0.5150818 0.8515889 0.7046885
We can also define arrays. These are generalizations of vectors and matrices to an arbitrary number of dimensions. A vector is a 1D array, a matrix is a 2D array. They are also called tensors.
D <- array(rnorm(2*4*5), dim = c(2, 4, 5))
D
, , 1
[,1] [,2] [,3] [,4]
[1,] -1.1494837 0.2753644 -0.6069892 -0.05844021
[2,] 0.1860709 0.6370736 0.8431723 -0.26942701
, , 2
[,1] [,2] [,3] [,4]
[1,] 0.4040653 -0.04382134 -0.2218624 0.1630243
[2,] 2.2414516 0.05005394 -1.1278691 0.4233376
, , 3
[,1] [,2] [,3] [,4]
[1,] -0.4949221 -0.5609446 1.75961921 -1.9711666
[2,] 0.7641867 0.5045475 -0.07788115 -0.5476058
, , 4
[,1] [,2] [,3] [,4]
[1,] 0.1495078 -1.636989 -1.7244224 0.2492688
[2,] -1.2017316 -1.202879 0.2139477 -1.0317071
, , 5
[,1] [,2] [,3] [,4]
[1,] -1.8509644 -0.7258099 -0.89236383 0.03814286
[2,] 0.2015316 -0.8681740 -0.08186036 -1.27583796
A[2,3] # selects the element in row 2, column 3
[1] 7
A[c(1,3),c(2,4)] # selects first and third row, second and fourth column
[,1] [,2]
[1,] 2 4
[2,] 10 12
A[1:3,2:4]
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 6 7 8
[3,] 10 11 12
A[1:2,] # the comma with a blank space selects all the columns
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
A[,1:2] # the comma with a blank space selects all the rows
[,1] [,2]
[1,] 1 2
[2,] 5 6
[3,] 9 10
[4,] 13 14
A[-c(1,3),] # removes first and third row
[,1] [,2] [,3] [,4]
[1,] 5 6 7 8
[2,] 13 14 15 16
A[-c(1,3),-c(1,3,4)] # removes the specified rows and columns
[1] 6 14
Indexing generic arrays is done in the same way.
D[1,3,2]
[1] -0.2218624
D[,3,]
[,1] [,2] [,3] [,4] [,5]
[1,] -0.6069892 -0.2218624 1.75961921 -1.7244224 -0.89236383
[2,] 0.8431723 -1.1278691 -0.07788115 0.2139477 -0.08186036
Data frames are simply lists of vectors. The vectors can be of different types (numeric, character, …). It is the most common format for the data.
a <- 1:3
b <- factor(1:3)
c <- letters[1:3]
x <- data.frame(a = a, b = b, c = c)
x
a b c
1 1 1 a
2 2 2 b
3 3 3 c
We can visualize the names of each variable.
names(x)
[1] "a" "b" "c"
Use list operators to extract columns.
x$a
[1] 1 2 3
x$b # a vector of factors
[1] 1 2 3
Levels: 1 2 3
x[["c"]] # different factors
[1] "a" "b" "c"
Use matrix indexing, too!
x[1:2, 2:3]
b c
1 1 a
2 2 b
Subsetting some rows and visualize the correspongind columns:
y <- subset(x, b %in% 2:3, select = c(b, c))
y
b c
2 2 b
3 3 c
Lots of new fancy packages for manipulating data frames. See reshape
, plyr
, dplyr
, sqldf
and others.
We can define a vector of characters denoting the birthplace of some individuals.
birthplace <- c("Austin","Austin","Houston","Houston","Austin","Dallas","Houston","Austin","Dallas","Houston","Austin")
Now we need to convert this vector to a factor (categorical vector), by explicitly providing with all the possible values of the vector.
birthplace <- factor(birthplace, levels = c('Austin','Houston','Dallas', 'San Antonio')) # note that in our fake data nobody was born in San Antonio
plot(birthplace) # barplot of the frequencies
abs_freq <- table(birthplace)
abs_freq
birthplace
Austin Houston Dallas San Antonio
5 4 2 0
rel_freq <- table(birthplace)/length(birthplace)
rel_freq
birthplace
Austin Houston Dallas San Antonio
0.4545455 0.3636364 0.1818182 0.0000000
round(rel_freq, 2)
birthplace
Austin Houston Dallas San Antonio
0.45 0.36 0.18 0.00
R does linear algebra very efficiently.
x <- 1:3 # [1, 2, 3]
x %*% x # inner product: equivalent to x %*% t(x)
[,1]
[1,] 14
x %*% t(x) # outer product
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
There are two functions that are more computationally efficient: crossprod(x)
and tcrossprod(x)
. The first one computes t(x) %*% x
and the second x %*% t(x)
.
A <- matrix(1:9, 3, 3, byrow = T)
A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
A %*% x # matrix - vector product
[,1]
[1,] 14
[2,] 32
[3,] 50
A %*% A # matrix - matrix product
[,1] [,2] [,3]
[1,] 30 36 42
[2,] 66 81 96
[3,] 102 126 150
eigen(A) # eigenvalue decomposition
eigen() decomposition
$values
[1] 1.611684e+01 -1.116844e+00 -1.303678e-15
$vectors
[,1] [,2] [,3]
[1,] -0.2319707 -0.78583024 0.4082483
[2,] -0.5253221 -0.08675134 -0.8164966
[3,] -0.8186735 0.61232756 0.4082483
We can also compute the sum of the elements of a matrix
sum(A)
[1] 45
the sum along the rows
rowSums(A)
[1] 6 15 24
and the sum along the columns.
colSums(A)
[1] 12 15 18
For more general operations along rows and columns, use apply
.
apply(X = A, MARGIN = 1, FUN = var) # along rows
[1] 1 1 1
apply(X = A, MARGIN = 2, FUN = var) # along columns
[1] 9 9 9
Advantages of doing linear algebra in R:
We start with the normal distribution.
x <- rnorm(50) # if no additional argument is provided, rnorm is standard normal distriubtion
y <- x + rnorm(50, mean = 50, sd = .1)
cor(x,y)
[1] 0.9959406
set.seed(123)
y <- rnorm(50)
cor(x,y)
[1] 0.02657538
mean(y)
[1] 0.03440355
var(y)
[1] 0.8572352
sqrt(var(y))
[1] 0.92587
sd(y)
[1] 0.92587
Note that rerunning the same lines gives different results. These functions involve randomness. To fix the outcomes and have consistent results across different computers, we need to fix the random generator’s seed with set.seed()
.
We have also encountered the Bernoulli distribution.
x <- sample(x = 0:1, size = 1000, replace = TRUE, prob = c(0.5, 0.5))
barplot(table(x))
mean(x)
[1] 0.507
var(x)
[1] 0.2502012
R expressions are vectorized.
x <- 1:5
x
[1] 1 2 3 4 5
y <- 2 * x
y
[1] 2 4 6 8 10
The rule is that the expression is evaluated for each element.
z <- 1:5
2 * z
[1] 2 4 6 8 10
for (i in 1:5){
z[i] <- 2 * z[i]
}
z
[1] 2 4 6 8 10
Functions may or may not return a value for each element of an input vector.
sqrt(1:5)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
sum(1:5)
[1] 15
summary(1:5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 3 3 4 5
Writing functions is fundamental for good formatting of the code.
f <- function(a, b = 2, c = NULL){
res <- a * b
if (!is.null(c)) res <- res * c
return(res)
}
f(1, 2, 3)
[1] 6
In the previous function definition, note the role of the default values.
f(4)
[1] 8
Let’s take a look at how R handles the scope of the functions. Scope refers to the visibility of variables.
c <- "mice" # global
f <- function(a = "three") # function formals
{
b <- "blind" # function body
return(paste(a, b, c)) # three scopes
}
f()
[1] "three blind mice"
Object not found in current scope initiates search upward into enclosing scopes (it can be dangerous).
x <- 2 # global scope
f <- function() x <- 2 * x # 2 different x's here
x
[1] 2
f()
x
[1] 2
The assignment in the function body creates a variable x
whose scope is the function body, not outside of it.
RColorBrewer
is an excellent package! But if you are nerdy enough, you can also consult this color Bible. Unfortunately the basic colors in R are pretty bad. See for example
plot(1:9, 1:9, pch = 16, col = 1:9)
You can also use different symbols to differentiate the points.
plot(1:20,1:20,pch=1:20)
Look how nicer the palettes in RColorBrewer are! Some of them are recommended for color blindness.
library(RColorBrewer)
display.brewer.all()
cols <- brewer.pal(9, "Set1")
plot(1:9, 1:9, pch = 16, col = cols[1:9])
set.seed(123)
x <- rnorm(100) # 100 draws from a standard normal distribution
y <- rnorm(100)
plot(x, y, col = 'green') # you can also use the numbers to define the colors, as seen in the previous example
plot(x, y, xlab = "this is the x-axis", ylab = "this is the y-axis",
main = "Plot of X vs Y", col = cols[3])
plot(x, y, col = cols[3], pch = 16)
We can also add a line with the command abline()
, that take the argument v =
for a vertical line, h =
for a horizontal line, and a = , b =
for intercept and slope, respectively.
plot(x, y, col = cols[3], pch = 16)
abline(h = 0, lty = 2, lwd = 2, col = cols[1])
Let us now see the use of two very useful plotting functions: lines()
and hist()
. First, we generate observations from a normal distribution using rnorm()
.
samp <- rnorm(n = 5000, mean = 3, sd = 2)
hist(samp, col = 'gray', border = 'white', probability = T, nclass = 30, main = 'Histogram', xlab = '')
x_grid <- seq(min(samp), max(samp), length.out = 100)
lines(x_grid, dnorm(x_grid, mean = 3, sd = 2), lwd = 2, col = cols[1])
Note how we overlay a curve on top of the histogram. We calculate the function on a fine grid of points on the x axis. Connecting the points (X, Y) approximate the function!
plot(x_grid, dnorm(x_grid, mean = 3, sd = 2), lwd = 2, col = cols[1],
main = 'How to plot functions', xlab = 'x', ylab = 'f(x)')
Two dimensional plots are a bit more complex. For example, say that we want to compute the contour lines of the function \(f(x, y) = \cos(y) / (1 + x^{2})\), on the domain \(x \in [-\pi, \pi], y \in [-\pi, \pi]\). We can represent the function surface in different ways.
x <- seq(-pi, pi, length.out = 50)
y <- x
f <- outer(x, y, function(x,y){cos(y)/(1+x^2)})
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = T)
fa <- (f - t(f))/2
contour(x, y, fa, nlevels = 15)
image(x, y, fa)
persp(x, y, fa)
persp(x, y, fa, theta = 30)
persp(x, y, fa, theta = 30, phi = 40)
ggplot2
The same things can be done with the awesome ggplot2
package. It is just a matter of taste, I personally prefer this but I realize that it might be cumbersome for some.
library(reshape2)
library(ggplot2)
set.seed(123)
x <- rnorm(100) # 100 draws from a standard normal distribution
y <- rnorm(100)
ggplot() +
geom_point(aes(x = x, y = y), col = cols[3]) +
labs(x = "X", y = "Y", title = "Plot of X vs Y")
x <- seq(-pi, pi, length.out = 50)
y <- x
f <- outer(x, y, function(x,y){cos(y)/(1+x^2)})
plot_f <- melt(f)
ggplot() +
geom_contour(aes(x = Var1, y = Var2, z = value), data = plot_f) +
labs(x = "x", y = "y") +
theme(legend.title=element_blank())
fa <- (f - t(f))/2
plot_fa <- melt(fa)
ggplot() +
geom_contour(aes(x = Var1, y = Var2, z = value), data = plot_fa) +
labs(x = "x", y = "y") +
theme(legend.title=element_blank())
ggplot() +
geom_raster(aes(x = Var1, y = Var2, fill = value), data = plot_fa) +
scale_fill_viridis_c(option = 'D') +
labs(x = "x", y = "y") +
theme(legend.title=element_blank())
Write a R function to get the first \(n\) Fibonacci numbers.
Write a script to plot the probability density functions of three random variables: \(X_{1} \sim N(0, 3^2)\), \(X_{2} \sim N(10, 0.5^2)\), \(X_{3} \sim N(-2, 1^2)\).
This exercise has the goal of getting familiar with the central limit theorem (CLT). Suppose \(X_1, X_2, \dots, X_n\) is a sequence of i.i.d. random variables (not necessarily normal) with well defined mean \(\mu\) and variance \(\sigma^2\). Then, as \(n\) approaches infinity, the random variable \[\frac{X_1 + \dots + X_n}{n} \rightarrow \mathcal{N}(\mu, \sigma^2 / n)\]
For this exercise, sample \(100\) observations from a Bernoulli experiment with parameter \(p = 0.2\). Now take the expectation of this sample, using the function mean()
. Can you repeat this process for \(1000\) times? Remember how to use the for
loop. Now, take the \(1000\) generated sample means and plot them on a histogram. Is the shape of the sample familiar? Can you overlay the theoretical normal distribution that we expect from the CLT? You’ll have to remember mean and variance of the Bernoulli experiments.